Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. Are they doing exactly like what they were told, or are they making some of the common mistakes? In this project, we aim to use data collected from a set of sensors to predict the manner someone performas some of the simple weight lifting exercise.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). 4 fifferent accelerometers have been placed on the participants themselves and on the dumbbell - 3 on the belt, forearm and arm, and 1 on dumbell. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

To solve this problem, we first applied Principal Componenet Analysis to narrow down the vairables from 160 to 25. We then applied two multi-claccification models - Random Forest and K-nearest Neighbours. We selected Random Forest due to higher accuracy.

Data Processing

i. Load data

training <- read.csv("pml-training.csv", na.strings = c("", "NA"))
testing <- read.csv("pml-testing.csv", na.strings = c("", "NA"))
# We hided results in the final output because it's too long
head(training,10)
dim(training)
## [1] 19622   160

Our obeservations: 1) the dataset is quite large. It also has a lot of variables. 2) Each sensor captures multiple datapoints. Some variables has a lot of NA values. They might not be useful and we should consider exlcude these variables. 3) The name of variable ends with the name of the particular sensor that is providing the datapoint. This allows us to group variables together and possibily conducts a PCA to reduce dementions.

ii. Remove NA columns

# We hided results in the final output because it's too long
library(dplyr)
col_na_count <- training %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))
col_na_count

Here we have count of NA in each column. As we can see, for columns with NA, NA is the dominated value - there are 19216 NAs out of 19622 rows. Let’s remove columns with more than half rows NA.

# We hided results in the final output because it's too long
training_filter <- training %>% 
    select(-which(col_na_count > 0.5 * nrow(training)))
head(training_filter)

Now we are down to 60 variables.

iii. Conduct PCA to reduce dimentions

Among the predictors, we can see they can be grouped in to groups based on their names. Let’s group them and check the correlation among variables. First let’s take a look at Belt.

# We hided results in the final output because it's too long
training_belt <- training_filter %>% select(contains("belt"))
cor(training_belt)
head(training_belt)

As we can see, a lot of variables are highly correlated, such as -0.99. We can use Principle Componenet Analysis to reduce the dementions. Notice the scales of these variables are very different and there are both positive and negative values, we will need to center and scale our data as well.

library(caret)

training_belt_pcamodel <- preProcess(training_belt, method = c("center", "scale", "pca"), thresh = 0.90)
training_belt_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 4 components to capture 90 percent of the variance

As we can see, we only need 4 components do perserve 90% of the variance. This will decrease reduce our data dimentions dramatically.

# apply PCA model
training_belt_pca <- predict(training_belt_pcamodel, newdata = training_belt)
# rename the columne names
names(training_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")

Next we will repeat the process for the rest three group of variable - forearm, arm, and dumbell.

# Create model for forearm
training_forearm <- training_filter %>% select(contains("forearm"))
training_forearm_pcamodel <- preProcess(training_forearm, method = c("center", "scale", "pca"), thresh = 0.90)
training_forearm_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 8 components to capture 90 percent of the variance
# We need 8 variables to perserve 90% variance
training_forearm_pca <- predict(training_forearm_pcamodel, newdata = training_forearm)
names(training_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5","forearm_pca6","forearm_pca7","forearm_pca8")
# Create model for arm
training_arm <- training_filter %>% select(contains("_arm"))
training_arm_pcamodel <- preProcess(training_arm, method = c("center", "scale", "pca"), thresh = 0.90)
training_arm_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 7 components to capture 90 percent of the variance
# We need 7 variables to perserve 90% variance
training_arm_pca <- predict(training_arm_pcamodel, newdata = training_arm)
names(training_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5","arm_pca6","arm_pca7")
 # Create model for dumbbell
training_dumbbell <- training_filter %>% select(contains("_dumbbell"))
training_dumbbell_pcamodel <- preProcess(training_dumbbell, method = c("center", "scale", "pca"), thresh = 0.90)
training_dumbbell_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 6 components to capture 90 percent of the variance
# We need 6 variables to perserve 90% variance
training_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = training_dumbbell)
names(training_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5","dumbbell_pca6")

Now let’s combine all the variables after PCA.

training_pca <- data.frame(user_name = training[, 2], classe = training$classe, training_belt_pca, training_arm_pca, training_forearm_pca, training_dumbbell_pca )

Nowe we have 33 variables. Let’s do some Exploratory Data Analysis.

Explortory Data Analysis

table(training_pca$user_name, training_pca$classe)
##           
##               A    B    C    D    E
##   adelmo   1165  776  750  515  686
##   carlitos  834  690  493  486  609
##   charles   899  745  539  642  711
##   eurico    865  592  489  582  542
##   jeremy   1177  489  652  522  562
##   pedro     640  505  499  469  497

Each of the 6 participants performed 5 classes of workout.

# plot classes vs users
library(plotly)
plot_ly(training_pca, x = ~classe, y = ~belt_pca1, color = ~user_name, type = "box") %>% 
  layout(title = "belt_pca1 for each class per user")

In this plot, we can see that user is differently a very important factor. No matter which classe is, the range of motion seems to be different from users to users. This tells us user should be a predictor included in our model.

plot_ly(subset(training_pca, user_name = "adelmo"), x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~classe) %>% 
    layout(title = "Adelmo performing different classes")

In this plot, we only look at Adelmo’s movements. We can see that: 1) each classes has different range of movements. 2) There might be some outliers in the range of movement. For example, one data point has extremely large dumbbell_pca1. However, Because we only plotted 3 vairables, it’s hard to see if these are real outliers, so we decided to keep them.

Select models

First, let’s slice the training data into training and validation sample.

# Define train control for k fold cross validation
# train_control <- trainControl(method="cv", number= 5, savePredictions = TRUE)

set.seed(123)
index <- createDataPartition(training$classe, p = 0.6, list = FALSE)
pca_tra <- training_pca[index,]
pca_val <- training_pca[-index,]

This is a mutiple calssification problem. The models we can use are tree based models and k-nearest neighbours.

Model 1 - Random Forest

#Random Forest
model1 <- train(classe ~., data= pca_tra, method="rf")
pca_val$pred_rf <- predict(model1, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2216    8    3    2    3
##          B   56 1449   12    1    0
##          C    3   32 1322   11    0
##          D    6    0   62 1210    8
##          E    0    0   11    7 1424
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9674, 0.9749)
##     No Information Rate : 0.2907          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9637          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9715   0.9731   0.9376   0.9829   0.9923
## Specificity            0.9971   0.9891   0.9929   0.9885   0.9972
## Pos Pred Value         0.9928   0.9545   0.9664   0.9409   0.9875
## Neg Pred Value         0.9884   0.9937   0.9864   0.9968   0.9983
## Prevalence             0.2907   0.1898   0.1797   0.1569   0.1829
## Detection Rate         0.2824   0.1847   0.1685   0.1542   0.1815
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9843   0.9811   0.9652   0.9857   0.9948

Model 2 - K-nearest neighbours

#KNN Model
ctrl <- trainControl(method="repeatedcv",repeats = 3)
model2 <- train(classe ~., data= pca_tra, method="knn", trControl = ctrl)
pca_val$pred_knn <- predict(model2, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2182   24   15    9    2
##          B   81 1379   45   11    2
##          C   17   30 1283   29    9
##          D    5    4   68 1205    4
##          E    2   16   15   20 1389
## 
## Overall Statistics
##                                           
##                Accuracy : 0.948           
##                  95% CI : (0.9429, 0.9528)
##     No Information Rate : 0.2915          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9342          
##                                           
##  Mcnemar's Test P-Value : 1.762e-12       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9541   0.9491   0.8997   0.9458   0.9879
## Specificity            0.9910   0.9783   0.9868   0.9877   0.9918
## Pos Pred Value         0.9776   0.9084   0.9379   0.9370   0.9632
## Neg Pred Value         0.9813   0.9883   0.9779   0.9895   0.9973
## Prevalence             0.2915   0.1852   0.1817   0.1624   0.1792
## Detection Rate         0.2781   0.1758   0.1635   0.1536   0.1770
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9725   0.9637   0.9432   0.9668   0.9898

Prediction

Random Forest model has higher accuracy than the KNN model. It has 97% accuracy. We will use this model to predict the testing data.

Remove NA columns from testing data

col_na_count_testing <- testing %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))

testing_filter <- testing %>% 
    select(-which(col_na_count_testing > 0.5 * nrow(testing)))

Create PCA for testing data

# Create pca variables with "belt"
testing_belt <- testing_filter %>% select(contains("belt"))
testing_belt_pca <- predict(training_belt_pcamodel, newdata = testing_belt)
names(testing_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")

# Create pca variables with "forearm"
testing_forearm <- testing_filter %>% select(contains("forearm"))
testing_forearm_pca <- predict(training_forearm_pcamodel, newdata = testing_forearm)
names(testing_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5", "forearm_pca6", "forearm_pca7", "forearm_pca8")

# Create pca variables with "_arm"
testing_arm <- testing_filter %>% select(contains("_arm"))
testing_arm_pca <- predict(training_arm_pcamodel, newdata = testing_arm)
names(testing_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5", "arm_pca6", "arm_pca7")

# Create pca variables with "dumbbell"
testing_dumbbell <- testing_filter %>% select(contains("dumbbell"))
testing_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = testing_dumbbell)
names(testing_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5", "dumbbell_pca6")

# Combine groups of categories

testing_pca <- data.frame(user_name = testing[, 2], testing_belt_pca, testing_arm_pca, testing_forearm_pca, testing_dumbbell_pca )

Make prediction using Random Forest

testing_pca$pred <- predict(model1, newdata = testing_pca, type = "raw")
testing_pca$pred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E